gh-121109: Fix performance of tarfile reading with "r|*"#121296
gh-121109: Fix performance of tarfile reading with "r|*"#121296TomiBelan wants to merge 3 commits into
Conversation
| if len(t) > size: | ||
| raise ReadError("decompress() returned too much data") |
There was a problem hiding this comment.
Do you see any scenario where this would be triggered? Looking at the zlib, bz2 and lzma decompressor docs for max_length, it looks like this shouldn't occur?
I have no issue with it being here but just checking if I'm missing something :)
There was a problem hiding this comment.
Right - it can only happen if there is a bug in the zlib, bz2 or lzma decompressor. I haven't checked their C source, but their docs say it should not occur.
It's not a necessary check, but I figured I'd add it just in case.
There was a problem hiding this comment.
We make assumptions about decompressors above. That they cannot make the loops infinite (when unconsumed_tail is not empty or needs_input is false, but decompress() returns an empty bytes), or that decompress() does not produce data by tiny chunks, making the code inefficient. If we trust them there, we should trust them here. This check only distracts. I suggest to remove it or replace with assert.
|
This PR is stale because it has been open for 30 days with no activity. |
|
My dearest stale bot, I wish it was only 30 days! 😢 |
|
Re @serhiy-storchaka #121109 (comment)
It's not needed. Small reads are already well-exercised by test_tarfile.py, especially StreamReadTest. When I add a print() to _Stream._read(), it shows a variety of This becomes clearer when you realize _Stream is just the outer shell, and the tar format parser itself also needs to read small chunks sometimes. But all right. I also tested it with this script, which succeeded. rm -rf data; mkdir data; for i in 1 2 3; do head -c1M /dev/zero | tr '\0' 'x' > data/$i.dat; done
tar caf test1M.tar.gz data ; tar caf test1M.tar.xz data ; tar caf test1M.tar.bz2 data ; tar caf test1M.tar.zst data
rm -rf data; mkdir data; for i in 1 2 3; do head -c100M /dev/zero | tr '\0' 'x' > data/$i.dat; done
tar caf test100M.tar.gz data ; tar caf test100M.tar.xz data ; tar caf test100M.tar.bz2 data ; tar caf test100M.tar.zst dataimport sys
import tarfile
for filename in ('test1M.tar.gz', 'test1M.tar.xz', 'test1M.tar.bz2', 'test1M.tar.zst'):
for mode in ('r|*', 'r:*'):
for chunk_size in (1, 10000, 500000):
print('running:', filename, mode, chunk_size, file=sys.stderr)
with tarfile.open(filename, mode) as tf:
for tarinfo in tf:
if tarinfo.isreg():
with tf.extractfile(tarinfo) as extractf:
total = 0
while True:
buf = extractf.read(chunk_size)
if not buf: break
total += len(buf)
assert buf == b'x' * len(buf)
assert len(buf) == chunk_size or total == tarinfo.sizeFull disclosure: this script does what you asked for, but it actually isn't a very good test. extractfile() returns a io.BufferedReader. So the 1 byte read and the 10000 byte read both become 131072 byte reads. And a benchmark: import sys
import time
import tarfile
for filename in ('test100M.tar.gz', 'test100M.tar.xz', 'test100M.tar.bz2', 'test100M.tar.zst'):
for mode in ('r|*', 'r:*'):
print('running:', filename, mode, file=sys.stderr)
start = time.time()
with tarfile.open(filename, mode) as tf:
tf.list()
print('took', time.time() - start, file=sys.stderr)I got 1.3, 1.2, 1.9, 1.5, 1.1, 1.1, 0.2, 0.2 seconds. (This is a different machine than last time.) |
|
I made some changes:
|
serhiy-storchaka
left a comment
There was a problem hiding this comment.
Thank you for your benchmarks, @TomiBelan. Could you please also test the case of listing a tarfile containing a large number of files? Maybe in that case we can exercise small reads?
And please test also the case of random uncompressable data. This is far from common case, but we need to look at corners too.
Your code looks correct. Note that for zlib decompressor we still have an issue of creating new bytes objects for unconsumed_tail, especially when reading by small chunks. It is smaller than the original issue, because the size of unconsumed_tail is limited by bufsize, while dbuf in the old code could be much larger. We can treat this in separate issue if it is worth to do.
Other solution for this issue could be keeping the position in dbuf instead of modifying dbuf. The advantage of your solution is that it also limits the amount of consumed memory (dbuf can be very large). We could mitigate this by combining two approaches and using decompress(cbuf, max(bufsize, size - c)). But this can make the code more complicated, and I am not sure that it will be faster.
You can make experiments or leave this to other issue. I'll approve this PR after getting the results of new benchmarks if they are satisfying.
| if len(t) > size: | ||
| raise ReadError("decompress() returned too much data") |
There was a problem hiding this comment.
We make assumptions about decompressors above. That they cannot make the loops infinite (when unconsumed_tail is not empty or needs_input is false, but decompress() returns an empty bytes), or that decompress() does not produce data by tiny chunks, making the code inefficient. If we trust them there, we should trust them here. This check only distracts. I suggest to remove it or replace with assert.
This PR fixes #121109.
Using the test files and test script described in the issue:
test.tar.gzr:*test.tar.gzr|*test.tar.xzr:*test.tar.xzr|*test.tar.bz2r:*test.tar.bz2r|*After this PR, tf.list() of
r|*is the same speed asr:*, as expected. Not orders of magnitude slower.